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Abstract 

Demographic change of human populations is one of the central questions for delving into the past of human beings. To 
identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of 
the non-recombining region, discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates 
can be estimated much more precisely using a molecular clock. We found that all the Paleolithic divergences were binary; 
however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate 
of 1 x10~'/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at 
that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be 
related to the development of agriculture. 
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introduction 

Demographic change is one of the central questions in 
understanding human history, and strong population expansions 
may be linked to various events as climate changes, alteration of 
social structure, or technological innovations. The recent advent of 
next-generation sequencing technology enabled systematic analy- 
sis of the population history using the information from the whole 
genome with less ascertainment bias, so we can re-assess how the 
various factors have influenced the human population size and 
structure [1,2]. Recent analyses of mitochondrial genomes 
revealed that the expansions of female lineages of East Asians 
[3] and those of Europeans [4] started before the Neolithic Era, 
contradictory to the hypothesis that the agricultural innovation 
constitutes the primary driving force of population expansions [5]. 
These observations prompted this study to investigate expansions 
of male lineages. 

The Y chromosome contains the longest non-recombining 
region (~60 Mbp, in which ~10 Mbp is unique sequence in the 
genome and easy to analyze) in the human genome [6,7], making 
it an informative tool for reconstructing genetic relationship of 
human populations and paternal lineages, and dating important 
evolutionary and demographic events [8,9,10,11]. However, the 
sequencing data of Y chromosomes of human populations were 
insufficient and biased even for those of current 1000-genome 
project for which coverage on Y chromosome was low (on average 
<1.4x in East Asian samples) [12]. 



According to the phylogenetic tree of Y chromosome, all the 
modern males could be categorized into 20 major monophyletic or 
paraphyletic groups (referred to as A to T) and their subclades 
[13,14]. Nearly all the Y chromosomes outside Africa are 
derivative at the SNP Ml 68 and belong to any of its three 
descendent super-haplogroups - DE, C, and F [9,10,15], strongly 
supporting the out-of-Africa theory. The time of the anatomically 
modern human's exodus from Africa has yielded inconsistent 
results ranging from 39 kya [16], 44 kya [10], 59 kya [17], 68.5 
kya [18] to 57.0-74.6 kya [19]. 

To achieve sufTiciently high coverage in the non-recombining 
regions of Y chromosome (NRY) and an adequate representation 
of individual samples, we selected 110 males, encompassing the 
haplogroups O, C, D, N, and which are common in East 
Eurasians, as well as haplogroups J, G, and R which are common 
in West Eurasians (see Table SI), and sequenced their non- 
repetitive segments of NRY using a pooling-and-capturrng 
strategy. 

Results 

Overall ~4,500 base substitutions were identified in all the 
samples from the whole Y chromosome, in which >4,300 SNPs 
that has not been publicly named before 2012 (ISOGG etc.). We 
designated each of these SNP a name beginning with T' (for 
Fudan University) (see Table S2). We obtained -3.90 Mbp of 
sequences with appropriate quality (at least 1 x coverage on >100 
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out of 1 10 samples, see Table S3), and identified ~ 3,600 SNPs in 
this region. A maximum parsimony phylogenetic tree of the 78 
individuals with good coverage was reconstructed (Fig. 1 and Fig. 
SI), the topology of which is congruent with the existing tree of 
human Y chromosome [13,20]. The tree contained samples from 
haplogroups C, D, G, J, N, O, Q, and R, and thus represented all 
the three super-haplogroups out of Africa - C, DE and F. In 
addition to the known lineages, many new downstream lineages 
were revealed. AU the earlier divergences were found to be 
bifurcations, except for three star-like structures, i.e. multiple 
lineages branching off from a single node, were observed under 
Haplogroup 03a-M324, indicating strong expansion events. 

By using Bayesian method [2 1] with a constant mutation rate of 
1x10 ^ substitution/base /year [7,19] (or one substitution per 2.56 
years on 3.9 Mbp range, without considering the uncertainty of 
mutation rate), we calculated the date of each divergence event 
throughout the tree. The first divergence event out of Africa, i.e. 
between Haplogxoup DE and the ancestor of C and F, is dated at 
54.1 kya (95%CI 50.6-58.2), inside the range of previous 
estimations. Within the 3.9 Mbp range, only 3 SNPs were 
observed between the divergence events of DE/CF and C/F, 
indicating that DE, C, and F likely emerged subsequently in less 



than a thousand years. After diverged from Haplogroup C, no 
major split was observed in F for 18 thousand years, suggesting a 
strong bottleneck of F lineage. It should be noted that all the 
primary haplogroups (G, J, N, O, Q, and R) emerged before the 
last glacial maximum (LGM, ~20 kya), and most of the presendy 
known East Eurasian clades have branched off in the late Upper 
Paleolithic Age (before 10 kya). All divergences on this tree before 
7 kya were binary, suggesting that during the Paleolithic Age, slow 
population growth and bottlenecks or drift eradicated most of the 
ever existing clades [22]. 

The most surprising discovery in the tree is the three star-like 
expansions in Haplogroup 03-M324, i.e. under the Ml 17 clade, 
the M134xM117 paragroup, and the 002611 clade. Here we 
denote the three star-like expansions as Oot, 0(3, and Oy, 
respectively (see Discussion). Since the sample selection for high- 
throughput sequencing was intended for representing a wide 
variety of clades in East Asian populations, a star-like expansion 
indicates successful expansion of male lineages within a very short 
period (<500 years). These three clades are present with high 
frequency across many extant East Asian populations [23,24] and 
encompass more than 40% of the present Han Chinese in total 
(estimated 16% for Oot, 11% for Ofi, and 14% for Oy) [20]. It is 




33.0 

(30.9-35.21 

k-mK 




1J-M429 



0 ^242 , 

J-M304^ 



[ 



54.1 

(50.6-58.2) 



35.8- 

(33-3-38.3) 



O-M201 



OF.-YAP 



n-MI74 



-CP 



Pan troglodytes 



Figure 1. Phylogenetic tree of Kiuman Y chromosome, emphasizing the three star-lil(e expansions (Oa, Op, Oy). The tree was 
constructed from 78 samples sequenced in this study, together with three published East-Asian genomes and a chimpanzee genome. The branch 
lengths (horizontal lines) are proportional to the number of SNPs on the branch. Numbers in red indicate the coalescence time (in years, considering 
the variation in SNP counting, but ignoring uncertainty in mutation rate) and 95% confidence intervals of the node. For more details, see Fig. SI. 
doi:1 0.1 371/journal.pone.01 05691 .gOOl 
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conspicuous that roughly 300 miUion extant males are the 
patrilineal progenies of only three males in the late Neolithic Age. 
The expansion dates are estimated 5.4 kya for Oof, 6.5 for 0(3, 

and 6.8 for Oy (Fig. 1), after the shift to intensive agriculture in 
North China (since 6.8 kya) [25,26], in particular, during the 
Yangshao Culture (6.9-4.9 kya) in Central Yellow River Basin, 
Majiayao Culture (6.0-4.9 kya) in the Upper YeUow River Basin, 
and the Beixin (7.4—6.2 kya) - Dawenkou Culture (6.2-4.6 kya) in 
the Lower Yellow River Basin [2 7] . We therefore propose that in 
the late Neolithic Age, the three rapidly expanding clans 
estabhshed the founding patrilineal spectrum of the predecessors 
in East Asia. Since all the sequenced Han Chinese Ml 17+ samples 
are under the Oa expansion, and Ml 17+ subclade exists in 
moderate to very high frequency in many Tibeto-Burman ethnic 
groups [28,29,30], it would be of interest to know when the 
M 1 1 7+ individuals in other ethnic groups diverged with the ones 
in Han Chinese, and whether they are also under the Oa 
expansion, in order to trace the origin and early history of Sino- 
Tibetan language family. 

This study shows that all the strongly expanding Y chromo- 
somal haplogroups (i.e. 0-M175 or C-M130) had already 
migrated to East Asia more than 20 thousand years before their 
Neohthic expansion, thus supporting a boom of local farmers in 
China, which is consistent with the independent origin of 
agriculture [31], while difiFering from the case in Europe, where 
immigrant farmers from the Middle East contributed to the 
majority of modern Y chromosomes [32]. 

Discussion 

Although most of the sequences in this study were obtained 
from individuals in China, the haplogroup representation (C, D, 
G, J, N, O, Qj and R) already enabled us to calculate the times of 
most of the major divergence events outside of Africa, like G/IJK, 
NO/P etc., since the times were achieved using the hypothesis of 
molecular clock, and the results of divergence time between 
haplogroups would not be affected by from whichever continent or 
country the individuals were sampled. One good sequence from 
each of two haplogroups is enough for calculating their divergence 
time, and more sequences could only help to enhance the precision 
but would not gready change the result. 

The significant improvement of accuracy of dating in this study 
comparing to former East Asian studies is attributed to the large 
number of newly discovered SNPs. It is noted that the relative 
standard deviation of calculated divergence time is in inverse 
proportion to the square root of observed SNP occurrence (see 
Discussion SI). Furthermore, the averag(; counts of SNPs from the 
common ancestor of CF/DE to a modern individual is 210 in this 
study, limiting the theoretical 95%CI to only ±13.6%, comparing 
to 9 SNPs on average in the previous study with the 95% CI over 
±60% [16]. Considering that 3.9 Mbp range constitutes only less 
than half of 10 Mbp non-repetitive region in Y chromosome [7], 
the time resolution of east Asian Y chromosome phylogeny is 
expected to be doubled in the near future. 

The determination of mutation rate is a crucial question in 
calculation of the absolute divergent times, which caused the most 
dating differences among the studies [7, 3 3]. As revealed by 
previous studies, this inconsistency of mutation rate was resulted 
from two aspects: among different regions of the chromosome, and 
between older and younger time scales. The former has been 
disclosed in a study of autosomes, that the base substitution rate of 
CpG bases is 9.5-fold that of non-CpG bases [34], as well as for 
mitochondrial DNA, the substitution rate was not only differen- 
tiated between coding and control regions, but also in a base-by- 



base manner [35]. It is worth to point out that recently, Wei et al. 
published a similar study about Y chromosome sequencing of 36 
individuals (mainly Haplogroup Rib and Elb), in which 3.15 or 
8.83 Mbp range was sequenced [19], and they achieved a time of 
out-of- Africa at 57-74 kya using various methods, which is slighdy 
older than our result (54 kya), although the same mutation rate of 
lxlO~^ substitution/base/year were employed. The difierence 
could be ascribed to the regions chosen for date estimation; we 
compared the regions that Wei et al. and we studied, and found 
that in their study, the SNP density in the region that was 
sequenced only in their study is significantly higher than that in the 
region that both studies have sequenced (P<0.005) (Table S4). 

The difference between long-term (evolutionary) and short-term 
(genealogical) mutation rates has also been observed before. For 
calculating the divergence time using Y-chromosomal STR (short 
tandem repeat), the father-son mutation rate is about three times 
the "evolutionary" [36]; similar rate difference was also observed 
for mitochondrial nucleotide sul)stitution rate [35,37]. This 
controversy is usually explained by selection on deleterious 
mutations [38]. The autosomal genealogical substitution rate was 
estimated at 1.2x10 substitution/base/generation [34,39], 
which is less than half of th(" rat(' we used in this study. However, 
due to that 80-85 '/o of de novo mutations are attribut(;d to the 
father's side [34], and that the Y chromosome contains the least 
genes among the chromosomes and thus underwent lessened 
purifying selection [40], the mutation rate used in this study is stiU 
compatible with the previous studies. 

We also compared human and chimpanzee Y chromosomes 
(see SI Methods), and found ~45,800 substitutions between the 
two species which fall into the range that we compared for human 
samples; roughly 1/4 intra-human SNPs have no homologous loci 
on chimpanzee Y chromosome. Assuming the divergence between 
human and chimpanzee was at ~6,000 kya [41] and a constant 
substitution rate, the divergence time for DE and CF would be 
onh' 40 kya, which is younger than our result. This suggests that 
base-substitution rate between human and chimpanzee is higher 
than the rate inside human species, which can be explained with 
the huge interspecies difiFerence of the Y-chromosome structures, 
and the observation that the chimpanzee Y chromosomal genes 
decayed faster than human [40]. 

To overcome the factors for uncertainty of mutation rate, a 
calibration with series of samples of comparable time scales might 
be used. For the case of mitochondrial DNA, a recent study, in 
which several C-14 calibrated ancient complete sequences (4—40 
kya) were incorporated into the tree, made the absolute dates 
much more convincing [42], and we expect a parallel calibration 
for the Y chromosome in the near future. 

Despite of the mutation rate uncertainty, we evaluate our 
calculation of absolute divergence time as acceptable. Firsdy, our 
out-of-Africa date (54. 1 kya) is still within the range of previous 
estimations (39-74.6 kya). Secondly, the out-of-Afric:a date is 
similar to the recent estimation of two great mitochondrial 
expansions outside Africa - M (49.6 kya) and N (58.9 kya) [43]. 
Thirdly, it is not contradictory to the emergence of earliest modern 
human fossil out of Africa (e.g. ~ 50 kya in Australia) [44]. 

The accumulative substitution count from the DE/CF diver- 
gence to a modern individual varies from to 168 (YCHl 13) to 241 
(YCH198). Despite of this variation, by testing the assumption of 
molecular clock for the tree, the nuU hypothesis of a molecular 
clock could not be rejected (P>0.05) using PAML package v4.4 
[45] with the GTR model, unlike the mitochondrial tree from 
complete sequences, which showed violation to the clock 
assumption [43] . Part of the branch length variation may come 
from the false negative detection of SNPs, especially on a long 
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terminal branch; however, this efTect was mostly eliminated that 
we chose only the sequences with good quality for time estimation, 
so the branch length difference for these sequences should mainly 
reflects the real variation, and should have little effect in time 
estimation. 

The current Y haplogroups were named according to the rule of 
Y Chromosome Consortium (YCC) [14]. Along with increasing 
clades being discovered, the present nomenclature became 
cumbersome in some cases, e.g. 'Rlblb2ala2d3a' in the ISOGG 
tree 2010 (http://ww\v.isogg.org), which is prone to frequent 
name changes and hard to remember. Another commonly used 
nomenclature such as 'O-Ml 17' or '0-F46' is also not suitable for 
a determined star-like expansion, since there are many SNPs 
found ancestral to the star point, while these SNPs may be found 
not all equivalent in the future, e.g. some individuals might be 
found M 11 7-H but not belonging to the star expansion, then the 
name of the star point must be renewed. Therefore, here we 
propose a modification to the current nomenclature system: for 
any important star-like expansion that leads to large population 
(e.g. several millions) and multiple lineages (S5) in short time as 
revealed by long-range sequencing (> 1 Mbp was needed in order 
to limit the expansion within 1,000 years), a lineage name with 
lowercase Greek letter is appUed directly after the Latin capital 
letter of the first-class haplogroup name. For example, the star-like 
expansions under Ml 17, F46, and Fl 1 are now named as Oa, OP, 
and Oy, respectively, and their downstream lineages should stiU be 
named following the rule of YCC 2002, with Arabic number 
succeeding the Greek letter, e.g. Ootlal. These names of the star- 
like expansions are not bound to any single defining SNP (e.g. 
Ml 17), but to the expansion itself, i.e. the expansion names should 
always keep unchanged despite new side clades would be found to 
its upstream, in order to keep the nomenclature stable. For the 
currently equivalent SNPs on the branch leading to the expansion, 
we will know the occurring order only after vast amount of 
samples being genotyped for those SNPs. 

Since all the Paleolithic divergences of Y chromosome hneages 
are binary, the three roughly contemporaneous star-like expan- 
sions revealed in this study indicate a remarkable demographic 
change in the late Neolithic Age. The earliest agriculture in North 
China emerged before 10 kya [46], however, no distinct Y 
chromosomal expansion could be related to this event. The three 
star-like expansions happened several thousand years later, thus 
are likely linked to middle Neolithic cultures such as Yangshao 
(6.9-4.9 kya) and Dawenkou Culture (6.2^.6 kya) in the Yellow 
River Basin [27]. During this period, agriculture became mature 
and intensive, and the majority of human diet shifted from food 
collection into production [47,48]. Crop harvest constituted a 
more stable food source than hunting and gathering, and enabled 
nourishing population at higher density. In addition, liberation of 
males from hazardous hunting might have enhanced male viability 
into adulthood, thus the eflective population size of Y chromo- 
some increased. Besides the progress in agriculture, changes in 
social structure might also contribute to the patrilineal expansion. 
In the middle and late phases of Yangshao and Dawenkou culture, 
the burial customs showed a gradual transition from an egalitarian 
matrilineal society into a hierarchical patrilineal one [49,50]. 
Interestingly, the major maternal expansions in China shown by 
mitochondrial tree (among which are also several star-shaped 
expansions) occurred much earlier, at the late Paleolithic Age [3] . 
This immense non-synchrony between maternal and paternal 
expansion suggests a possible transition of social structure, that in 
the late Neolithic Age, a few paternal lineages achieved greater 
advantage on the existing basis of the population that started 
expansion since the Paleolithic Age. After the strong Neolithic 



expansions, the reproductive advantage of the farmers lasted for 
4,000 years, until most of the gatherer-and-hunter tribes in the 
Yellow River Basin were absorbed by the farming societies of 
Huaxia, from which the Han ethnicity was formed. 

Although without ancient DNA proofs, we cannot yet confirm 
the initial expanding regions of these three clans, whether they 
were original in the middle or lower reach of Yellow River Valley 
or migrated from the vicinity, we are now at least certain that a 
majority of Han Chinese did derive from just a few patrilineal 
ancestors in the Neolithic Age. Whether each of them could be 
related to the legendary' Emperors Yan and Huang or their tribes, 
is to be solved with more prudence and with the help of 
interdisciplinary genetic, archeological, ethnical, and documentary 
studies. 

Methods 

Ethics Statement 

The study was under the approval of the Ethics Committee of 
Biological Research at Fudan University. All the samples were 
collected with the informed consent signed by the sample donors. 

Samples 

We collected whole blood from ~800 Chinese male volunteers. 
Genomic DNA was extracted using QIAamp DNA Blood Mini 
Kit (QIAGEN, Hilden, NRW, Germany). SNaPshot multiplex kit 
(ABI, Carlsbad, CA, US) was used for typing Y chromosomal 
SNPs according to the most recent phylogenetic tree [13,20], and 
17 Y-STRs were determined with Y-filer kit (ABI, Carlsbad, CA, 
US). We selected 110 samples for next-generation sequencing, 
considering Y haplogroup, STR haplotype, as well as ethnic 
origin, in order to represent a broad spectrum of Y chromosome 
lineages of Chinese populations (Table SI). The selected samples 
covered most sublineages of Haplogroup O (72 samples), as well as 
Haplogroup C, D, G, J, N, Q, and R. 

Library preparation 

Genomic DNA of the selected samples were sheared using 
Bioruptor UCD-200 (Diagenode, Liege, Belgium) to 200-250 bp 
length, then were fixed to blunt-end, added 3' -A tail, and ligated 
with barcode-linked lUumina paired-end adaptors (Table SI). 
Ligation products were amplified by PCR, and 300 - 350 bp 
sections were extracted through agarose gel electrophoresis. 
Except for one sample (YCH53), the others were pooled into 8 
pools, with 10-15 samples in equal amount in each pool (Table 
SI). NRY was enriched using custom designed bait library (see 
below) of G3360-90000 SureSelect kit for lUumina pairc-d-end 
(Agilent, Santa Clara, CA, US) (the baits were Usted in Table S5). 
After another round of amplification, the pools went through 
single-end or paired-end sequencing with either GAIIx or 
HiSeq2000 sequencer for 100 or 2x100 cycles (lUumina, San 
Diego, CA, US). 

Bait design 

For Agilent SureSelect enrichment, bait library was designed 
with the following procedures: we first simulated reads mapping by 
generating 70-bp fragments of reference Y chromosome (hgl8 or 
NCBI build36) (http://hgdownload.cse.ucsc.edu/downloads. 
html^^human) for each 10 bp, e.g. chrY:l-70, chrY:ll-80 etc. 
The fragments were then mapped on the complete hgl8 genome 
using soap2 aligner (http://soap.genomics.org.cn/)[51]. All the 
match results with 0-2 mismatch bases on all chromosomes were 
summed up, and only the fragments without any repetitive 
matches (on either Y or other chromosome) were kept as unique 
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fragments. The range of those unique fragments was combined, 
and the combined ranges that are at least 240 bp long were 
selected for bait d(-sign on Agilent eArray website (https: / /earray. 
chem.agilent.com/earray/). Totally 40,379 baits covering 
4,292,864 bp were successfully designed and ordered for produc- 
tion. The ranges (on hgl8) of the generated baits are listed in 
Table S5. 

Processing of next-generation sequencing data 

The barcodes were removed and the reads were assigned to 
each sample. For paired-end sequencing, the reads were assigned 
only when the both barcodes were the same. The reads were 
mapped to hglB using bwa aligner (version 0.5.8) [52], and sam 
files were generated. Reads that were uniquely mapped on Y 
chromosome were extracted and transformed into bam file with 
samtools (version 0.1.8) [53]. Duplicates were removed by either 
Picard's MarkDuplicaie (http:/ /picard.sourceforge.net) (for single- 
end) or samtools rmdup (for paired-end). Indels were re-aligned 
using GATK [54,55], and after samtools mpileup, variations were 
called under the following criteria: for one sample, the position 
where the alternative allele (compared to hg 1 8) must be ^ 2 x 
coverage and at the same time ^3/4 of total coverage. All the 
variance candidates were collected, and genotypes were called on 
all the sequenced samples. Out of those candidates, SNPs were 
semi-manuaUy filtered considering consistency to the Y chromo- 
somal phylogeny, coverage (especially for the private SNPs, a 
minimum of 4x was required), and flanking sequence (to avoid 
those included or next to a homopoh mer or an STR). Three other 
publicly available East Asian genomes, YanHuang (YH) (Olal- 
P203) [56], KoRef (SJK) (02b-M176) [57], and GMIAKl 
(03a2c*-P164xM134) [58] were also included in analysis. 

Time estimation of the nodes in the phylogenetic tree 

A coverage filter was applied for time estimation, i.e., only the 
loci with good coverage among the sequenced samples, i.e., with 
more than 100 out of 110 SNP calling results with an 
unambiguous 0 or 1 were selected for phylogenetic reconstruction 
(0 for same as reference, 1 for mutation, question mark "?" when 
neither reference or alternative counts for more than 3/ 4 for this 
sample, and a minus mark "-" for no coverage. Beside these, "x" 
for tested 1 but should be 0, and "@" for tested ? but should be 1. 
The "x" and "@" were manually determined according to tree 
topology and the pattern of barcode confusion. See Table S6). 
SNPs were extracted into pseudo-sequences, and a maximum 
parsimony tree- of only good- and moderate-quality sequences was 
calculated using ARB program [59] . 

To avoid uncertainty in downstream hranclies that might 
influence branch lengths, we used only good-quality sequences for 
Bayesian time estimation (with on average >6x coverage at 
targeted regions and no obvious mislabeling, see Table SI). We 
used BEAST [21] for calculating the divergence time of each node 
in the phylogenetic tree. All the 47 high-quality sequences together 
with YH and SJK were used for time estimation. We generated 
pseudo-sequences from these 49 individuals, with the above 
described loci (>100 out of 110 with unambiguous result) that 
have polymorphism among the 49 individuals, 3823 bases in total. 
For missing or ambiguous data, we imputed the result following 
the tree topology (for the very rare case that imputation doesn't 
work, we assigned a random base from the two possible allele). We 
determined appropriate DNA substitution model with MrModelt- 
est 2.3 [60] for subsequent Bayesian MCMC analysis. For 
Bayesian MCMC analysis, the times of each cluster were estimated 
using BEAST 1.6.1 [21,61]. Each MCMC sample was based on a 
run of 20 million generations sampled every 10,000 steps with the 



first 2 million generations regarded as burn-in. To test the 
assumption of molecular clock for the tree, we used PAML 
package v4.4 with the GTR model. The null hypothesis of a 
molecular dock cannot be rejected (P>0.05) by comparison 
between the models. We used the GTR model of nucleotide 
substitution determined with MrModeltest 2.3 with a strict clock. 
The single nucleotide substitution rate was set as 1x10 ^/ 
nucleotide/year. The effective sample size of the coalescent prior 
was above 900. A relaxed clock was also employed for comparison 
and the results were similar. 

Chimpanzee genome (panTro3) [62] (http://hgdownload.cse. 
ucsc.edu/ downloads.html#chimp) was used as comparison for 
time estimation. AH the base substitutions of chimpanzee genome 
comparing to hglB were discovered using the similar method as 
for bait design: the simulated 100-bp-long reads at each 10 bp 
were generated and mapped onto hgl8 using hwa, and SNPs were 
discovered. The SNPs intra human beings were also called for the 
chimp reads. The root for the human samples in this study was 
thus determined. 

Supporting Information 

Discussion SI Supporting Discussions. 

(DOCX) 

Figure SI Phylogenetic tree of human Y chromosome. 

The tree is constructed from 78 samples sequenced in this study, 
together with three published East-Asian genomes (YH, SJK, 
GMIAKl) and a chimpanzee genome (Pan), which are labeled 
with '*'. Except for YCH145 (Spanish), SJK and GMIAKl (both 
Korean), all the human samples are Chinese. The branch lengths 
(horizontal lines) are proportional to the number of SNPs on the 
branch, and the SNP numbers are labeled under the branches). 
The SNPs labeled on the horizontal lines are only representative. 
The SNPs labeled in green represent newly recognized clades in 
this study. The estimated coalescence time (in years) for the nodes 
are calculated only from good-quality (> 6x coverage) human 
sequences (in bold italic) by BEAST with relaxed clock (see SI 
Methods), and the numbers in brackets are for 95% confidence 
intervals (ignoring uncertainty in mutation rate). 
(EPS) 

Figure S2 Revised migration routes of modem human, 

(a) Split of the first out-of-Africa ancestor and early migration in 
Asia, (b) The emergence of the main haplogroups before and 
during the LGM. (c) Foundation of present haplogroup distribu- 
tion before 8 kya. (d) Major population expansion events in East 
Asia (shaded) in the Neolithic Age and their probable relationship 
with modem language families. 
(EPS) 

Table SI Samples list. 

(XLS) 

Table S2 Newly named SNPs. 
PCLS) 

Table S3 Ranges that were covered by at least 100 of all the 110 

samples. 

PCLS) 

Table S4 Comparison of SNPs discovered in this study and Wei 

et al. (2012). 

PCLS) 

Table S5 Bait regions designed for Agilent SureSelect capturing 

(positions are for chrY of hgl8). 
PCLS) 
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Table S6 Genotyping results. 
(XLS) 
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